Combining Index Structures for application-specific String Similarity Predicates
نویسندگان
چکیده
This paper presents new approaches for supporting string similarity matching based on a combination of techniques from the fields of information technology and computational linguistics to achieve better results regarding accuracy and efficiency. The homogenization of plain text reduces the volume of index structures and concurrently increases the quality of hit-lists. Furthermore it shows the careful and context dependent dealing with abbreviation, acronyms, and synonyms. The core of this work is a general approach to support stepwise application-specific and index supported string similarity predicates. It uses the Hybrid-Ternary Search Trie as one of the fastest index structure for strings. Tries guarantee best results for exact matching as well as inexact matching in preprocessed data and can be used for external data storage. Especially Hybrid-Ternary Search Tries are easy adaptable for common strings and provide best average results for inexact matching without any limitations.
منابع مشابه
B-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance
Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database...
متن کاملSupporting Similarity Operations Based on Approximate String Matching on the Web
Querying and integrating sources of structured data from the Web in most cases requires similarity-based concepts to deal with data level conflicts. This is due to the often erroneous and imprecise nature of the data and diverging conventions for their representation. On the other hand, Web databases offer only limited interfaces and almost no support for similarity queries. The approach presen...
متن کاملBridging Database Applications and Declarative Similarity Matching
Effective manipulation of string data is of fundamental importance to modern database applications. Very often, textual inconsistencies render equality comparisons meaningless and strings have to be matched in terms of their similarity. Previous work has proposed techniques to express similarity operations using declarative SQL statements. However, the non-trivial issue of embedding similarity ...
متن کاملSelectivity Estimation for Fuzzy String Predicates in Large Data Sets
Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964.” Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimati...
متن کاملWeighted Set-Based String Similarity
Consider a universe of tokens, each of which is associated with a weight, and a database consisting of strings that can be represented as subsets of these tokens. Given a query string, also represented as a set of tokens, a weighted string similarity query identifies all strings in the database whose similarity to the query is larger than a user specified threshold. Weighted string similarity q...
متن کامل